Overview

Dataset statistics

Number of variables11
Number of observations699
Missing cells0
Missing cells (%)0.0%
Duplicate rows8
Duplicate rows (%)1.1%
Total size in memory98.8 KiB
Average record size in memory144.7 B

Variable types

NUM9
BOOL1
CAT1

Reproduction

Analysis started2020-05-05 09:45:19.504973
Analysis finished2020-05-05 09:45:36.493136
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
Dataset has 8 (1.1%) duplicate rows Duplicates
Uniformity of Cell Shape is highly correlated with Uniformity of Cell SizeHigh Correlation
Uniformity of Cell Size is highly correlated with Uniformity of Cell ShapeHigh Correlation

Variables

ID
Real number (ℝ≥0)

Distinct count645
Unique (%)92.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1071704.099
Minimum61634
Maximum13454352
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum61634
5-th percentile411453
Q1870688.5
median1171710
Q31238298
95-th percentile1333890.8
Maximum13454352
Range13392718
Interquartile range (IQR)367609.5

Descriptive statistics

Standard deviation617095.7298
Coefficient of variation (CV)0.5758079404
Kurtosis257.7171591
Mean1071704.099
Median Absolute Deviation (MAD)250872.2069
Skewness13.67532594
Sum749121165
Variance3.808071398e+11
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 61634. 317551.5 635987. 897321.5 1001025. ... 1117595.5 1165543.5 1242021.5 1371473. 13454352. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1182404 6 0.9%
 
1276091 5 0.7%
 
1198641 3 0.4%
 
466906 2 0.3%
 
1116116 2 0.3%
 
1070935 2 0.3%
 
385103 2 0.3%
 
1293439 2 0.3%
 
1240603 2 0.3%
 
1277792 2 0.3%
 
Other values (635) 671 96.0%
 
ValueCountFrequency (%) 
61634 1 0.1%
 
63375 1 0.1%
 
76389 1 0.1%
 
95719 1 0.1%
 
128059 1 0.1%
 
ValueCountFrequency (%) 
13454352 1 0.1%
 
8233704 1 0.1%
 
1371920 1 0.1%
 
1371026 1 0.1%
 
1369821 1 0.1%
 

Clump Thickness
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.417739628
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median4
Q36
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.815740659
Coefficient of variation (CV)0.6373713473
Kurtosis-0.6237154123
Mean4.417739628
Median Absolute Deviation (MAD)2.297551581
Skewness0.5928585327
Sum3088
Variance7.928395456
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 2.5 4.5 5.5 8.5 9.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1 145 20.7%
 
5 130 18.6%
 
3 108 15.5%
 
4 80 11.4%
 
10 69 9.9%
 
2 50 7.2%
 
8 46 6.6%
 
6 34 4.9%
 
7 23 3.3%
 
9 14 2.0%
 
ValueCountFrequency (%) 
1 145 20.7%
 
2 50 7.2%
 
3 108 15.5%
 
4 80 11.4%
 
5 130 18.6%
 
ValueCountFrequency (%) 
10 69 9.9%
 
9 14 2.0%
 
8 46 6.6%
 
7 23 3.3%
 
6 34 4.9%
 

Uniformity of Cell Size
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.134477825
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation3.05145911
Coefficient of variation (CV)0.9735143395
Kurtosis0.09880288537
Mean3.134477825
Median Absolute Deviation (MAD)2.511255605
Skewness1.233136558
Sum2191
Variance9.3114027
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 4.5 8.5 9.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1 384 54.9%
 
10 67 9.6%
 
3 52 7.4%
 
2 45 6.4%
 
4 40 5.7%
 
5 30 4.3%
 
8 29 4.1%
 
6 27 3.9%
 
7 19 2.7%
 
9 6 0.9%
 
ValueCountFrequency (%) 
1 384 54.9%
 
2 45 6.4%
 
3 52 7.4%
 
4 40 5.7%
 
5 30 4.3%
 
ValueCountFrequency (%) 
10 67 9.6%
 
9 6 0.9%
 
8 29 4.1%
 
7 19 2.7%
 
6 27 3.9%
 

Uniformity of Cell Shape
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.207439199
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.971912767
Coefficient of variation (CV)0.9265686995
Kurtosis0.007010980047
Mean3.207439199
Median Absolute Deviation (MAD)2.466613863
Skewness1.161859179
Sum2242
Variance8.832265496
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 4.5 8.5 9.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1 353 50.5%
 
2 59 8.4%
 
10 58 8.3%
 
3 56 8.0%
 
4 44 6.3%
 
5 34 4.9%
 
7 30 4.3%
 
6 30 4.3%
 
8 28 4.0%
 
9 7 1.0%
 
ValueCountFrequency (%) 
1 353 50.5%
 
2 59 8.4%
 
3 56 8.0%
 
4 44 6.3%
 
5 34 4.9%
 
ValueCountFrequency (%) 
10 58 8.3%
 
9 7 1.0%
 
8 28 4.0%
 
7 30 4.3%
 
6 30 4.3%
 

Marginal Adhesion
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.806866953
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.855379239
Coefficient of variation (CV)1.017283429
Kurtosis0.9879470695
Mean2.806866953
Median Absolute Deviation (MAD)2.238034715
Skewness1.524468091
Sum1962
Variance8.1531906
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 3.5 8.5 9.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1 407 58.2%
 
3 58 8.3%
 
2 58 8.3%
 
10 55 7.9%
 
4 33 4.7%
 
8 25 3.6%
 
5 23 3.3%
 
6 22 3.1%
 
7 13 1.9%
 
9 5 0.7%
 
ValueCountFrequency (%) 
1 407 58.2%
 
2 58 8.3%
 
3 58 8.3%
 
4 33 4.7%
 
5 23 3.3%
 
ValueCountFrequency (%) 
10 55 7.9%
 
9 5 0.7%
 
8 25 3.6%
 
7 13 1.9%
 
6 22 3.1%
 

Single Epithelial Cell Size
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.21602289
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median2
Q34
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)2

Descriptive statistics

Standard deviation2.214299887
Coefficient of variation (CV)0.6885211836
Kurtosis2.169066423
Mean3.21602289
Median Absolute Deviation (MAD)1.685526636
Skewness1.712171802
Sum2248
Variance4.903123988
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 2.5 3.5 6.5 8.5 9.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
2 386 55.2%
 
3 72 10.3%
 
4 48 6.9%
 
1 47 6.7%
 
6 41 5.9%
 
5 39 5.6%
 
10 31 4.4%
 
8 21 3.0%
 
7 12 1.7%
 
9 2 0.3%
 
ValueCountFrequency (%) 
1 47 6.7%
 
2 386 55.2%
 
3 72 10.3%
 
4 48 6.9%
 
5 39 5.6%
 
ValueCountFrequency (%) 
10 31 4.4%
 
9 2 0.3%
 
8 21 3.0%
 
7 12 1.7%
 
6 41 5.9%
 

Bare Nuclei
Categorical

Distinct count11
Unique (%)1.6%
Missing0
Missing (%)0.0%
Memory size5.6 KiB
1
402
10
132
2
 
30
5
 
30
3
 
28
Other values (6)
 
77
ValueCountFrequency (%) 
1 402 57.5%
 
10 132 18.9%
 
2 30 4.3%
 
5 30 4.3%
 
3 28 4.0%
 
8 21 3.0%
 
4 19 2.7%
 
? 16 2.3%
 
9 9 1.3%
 
7 8 1.1%
 

Length

Max length2
Mean length1.188841202
Min length1
ValueCountFrequency (%) 
Decimal_Number 10 90.9%
 
Other_Punctuation 1 9.1%
 
ValueCountFrequency (%) 
Common 11 100.0%
 
ValueCountFrequency (%) 
ASCII 11 100.0%
 

Bland Chromatin
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.43776824
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q35
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.438364252
Coefficient of variation (CV)0.7092869798
Kurtosis0.1846213115
Mean3.43776824
Median Absolute Deviation (MAD)1.94976269
Skewness1.099969082
Sum2403
Variance5.945620227
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 3.5 5.5 6.5 7.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
2 166 23.7%
 
3 165 23.6%
 
1 152 21.7%
 
7 73 10.4%
 
4 40 5.7%
 
5 34 4.9%
 
8 28 4.0%
 
10 20 2.9%
 
9 11 1.6%
 
6 10 1.4%
 
ValueCountFrequency (%) 
1 152 21.7%
 
2 166 23.7%
 
3 165 23.6%
 
4 40 5.7%
 
5 34 4.9%
 
ValueCountFrequency (%) 
10 20 2.9%
 
9 11 1.6%
 
8 28 4.0%
 
7 73 10.4%
 
6 10 1.4%
 

Normal Nucleoli
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.86695279
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation3.053633894
Coefficient of variation (CV)1.065114816
Kurtosis0.4742686755
Mean2.86695279
Median Absolute Deviation (MAD)2.45570926
Skewness1.422261257
Sum2004
Variance9.324679956
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 3.5 9.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1 443 63.4%
 
10 61 8.7%
 
3 44 6.3%
 
2 36 5.2%
 
8 24 3.4%
 
6 22 3.1%
 
5 19 2.7%
 
4 18 2.6%
 
9 16 2.3%
 
7 16 2.3%
 
ValueCountFrequency (%) 
1 443 63.4%
 
2 36 5.2%
 
3 44 6.3%
 
4 18 2.6%
 
5 19 2.7%
 
ValueCountFrequency (%) 
10 61 8.7%
 
9 16 2.3%
 
8 24 3.4%
 
7 16 2.3%
 
6 22 3.1%
 

Mitoses
Real number (ℝ≥0)

Distinct count9
Unique (%)1.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.589413448
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.6 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile5
Maximum10
Range9
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.715077943
Coefficient of variation (CV)1.07906344
Kurtosis12.65787807
Mean1.589413448
Median Absolute Deviation (MAD)0.9764531796
Skewness3.560657844
Sum1111
Variance2.941492349
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 3.5 10. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1 579 82.8%
 
2 35 5.0%
 
3 33 4.7%
 
10 14 2.0%
 
4 12 1.7%
 
7 9 1.3%
 
8 8 1.1%
 
5 6 0.9%
 
6 3 0.4%
 
ValueCountFrequency (%) 
1 579 82.8%
 
2 35 5.0%
 
3 33 4.7%
 
4 12 1.7%
 
5 6 0.9%
 
ValueCountFrequency (%) 
10 14 2.0%
 
8 8 1.1%
 
7 9 1.3%
 
6 3 0.4%
 
5 6 0.9%
 

Class
Boolean

Distinct count2
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size5.6 KiB
1
458
0
241
ValueCountFrequency (%) 
1 458 65.5%
 
0 241 34.5%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

IDClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
010000255111213111
1100294554457103211
210154253111223111
310162776881343711
410170234113213111
510171228101087109710
6101809911112103111
710185612121213111
810330782111211151
910330784211212111

Last rows

IDClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
6896545461111211181
6906545461113211111
691695091510105454410
6927140393111211111
6937632353111212121
6947767153111321111
6958417692111211111
6968888205101037381020
69789747148643410610
69889747148854510410